Skip to content

Conversation

@SamuelOliveirads
Copy link

For now, this is a proof of concept for batching the MTP KV cache update. My approach was to unify all MTP-related graph execution into the main llama_decode function, which already handles batching and ubatching efficiently.

The performance is currently not ideal, and my guess is that it's due to the overhead of building and executing the full, combined graph every time decode is called, especially since we now use it for both KV cache updates and single-token draft generation.

From here, I was thinking of two main options to optimize this:

  1. Work more inside llama_decode to make the unified graph approach more efficient. This would mean investigating why the graph cache isn't being reused and finding ways to avoid redundant computations when we only need the MTP-specific part of the graph.
  2. Create a separate llama_decode_mtp function. This would be a stripped-down version of decode dedicated to running only the MTP graph. This would likely require creating several new helper functions to be shared between both decode paths to avoid too much code duplication.

I'm more inclined towards the first option for the long run. As more models adopt MTP, it seems better to have a single, unified decode path that can handle it, rather than maintaining two separate ones.

Still, I'm not completely sure which path is best, so I wanted to share this draft to open a discussion, while I continue to work on ideas to optimize the unified graph approach.

@SamuelOliveirads
Copy link
Author

Hey @F1LM1, just wanted to share some progress I've made on the batching issue.

I started by exploring the two options we discussed:

  • I first tried unifying the graphs. My benchmarks confirmed that graph creation itself is very fast, so the main bottleneck wasn't graph reuse. The real issue is that the unified approach makes the decode function quite bloated, and the performance problem lies elsewhere.
  • I also attempted a separate llama_decode_mtp, but it felt more complicated than I was expecting due to the need to share helper functions, so I returned to the idea of working inside the main decode function.

The last two commits should fix the major performance problem. Before, the logic was essentially running the full main model graph twice for MTP updates. Now, it properly reuses the state (embeddings and KV cache) from the main model's pass to feed the MTP pass, which saves a huge amount of time.

There is one known bug I still need to work on: the cache seems to get desynchronized on large prompts that are split across multiple ubatches. For example, when a 5000-token context is processed in 1500-token ubatches set by the user.

After the performance and this bug seem to be okay, I will refactor the changes inside decode to make them cleaner, as I don't like how invasive they are right now.

@F1LM1
Copy link
Owner

F1LM1 commented Sep 23, 2025

Hey, lots of good stuff here that gets us much closer to a presentable final product I think. On the other hand, I haven't had time to fully parse everything but a couple of things look like bugs to me.

Prompt processing order of operations is wrong. We follow the sequence

  • User provides prompt
  • Prompt is added to normal batch, prompt is also added to slot.mtp_kv_update_batch
  • Regular decode is run
  • MTP draft is run <- MTP layer KV cache is empty at this point!
  • MTP KV cache update is run

This one should be easy to fix.

A harder one is that KV cache seems to be getting written to the wrong cells. This also gets confounded by the first issue above. Let's say I have a 20-token prompt and I'm using GLM-4.5-Air (so layers 0-45 are the main model, and layer 46 is MTP). Then what happens is something like

  • the initial decode for the user prompt happens
    • the ubatch tells the model to write to KV cache slots 0-19 (this data is stored in sinfo)
    • this data gets written for layers 0-45 only
    • we sample a token
  • the first MTP decode gets run
    • the ubatch tells the model to write to KV cache slot 20
    • since we only run the MTP head on this step, only layer 46 gets populated
  • we go to decode the speculative batch (size 2)
    • the ubatch tells the model to write to KV cache slots 21-22
    • this gets written for layers 0-45 only
  • mtp_update_kv_cache gets called for all tokens so far
    • the last slot in KV cache gets removed (idx is too high?)
    • for some reason, mtp_update_kv_cache then tries to write over cache slot 22 again?

Then some combination of the above steps get repeated. When I have time over the coming days I'll take a look at the debugger and see what exactly is happening, but I think the main crux of the issue is this alternating behavior where we write cache data for layers 0-45, then 46, then 0-45, then 46, etc. This eventually degrades the output of both the main model and the MTP head because both see random gaps in their respective KV caches; at least, it seems that way in my limited testing. The output of a simple one-word prompt "test" seems pretty different between the base implementation and this commit. The output from this commit isn't incoherent per se, so maybe it wouldn't set off any red flags that something was wrong, but its response does seem quite a bit shorter. Re: this

There is one known bug I still need to work on: the cache seems to get desynchronized on large prompts that are split across multiple ubatches. For example, when a 5000-token context is processed in 1500-token ubatches set by the user.

I don't know what exact behavior you were seeing, but it could be related to what I'm seeing above.

As for how to fix the second issue, we mostly just need to make sure the MTP update steps that update layer 46 KV cache get placed in the correct slots in KV cache, which is to say, get applied to the same slots we've already written for layers 0-45. This doesn't cause a conflict because KV cache is layer-specific. You can see this in build_attn(); the graph forward writes directly to KV cache on a per-layer basis:

    // store to KV cache
    {
        const auto & k_idxs = inp->get_k_idxs();
        const auto & v_idxs = inp->get_v_idxs();

        ggml_build_forward_expand(gf, mctx_cur->cpy_k(ctx0, k_cur, k_idxs, il));
        ggml_build_forward_expand(gf, mctx_cur->cpy_v(ctx0, v_cur, v_idxs, il));
    }

The way my original implementation does this is quite hacky, it's in this step where I manually construct sinfos in llama_build_and_execute_mtp_graph

    std::vector<uint32_t> idxs;
    idxs.push_back(n_past);

    llama_kv_cache_unified::slot_info sinfo = {
        /*.s0   =*/ 0,
        /*.s1   =*/ 0,
        /*.strm =*/ { 0 },
        /*.idxs =*/ { idxs },
    };
    llama_kv_cache_unified::slot_info_vec_t sinfos;
    sinfos.push_back(sinfo);
    static_cast<llama_kv_cache_unified_context*>(mctx.get())->set_sinfos(sinfos);

to make sure that the MTP graph always writes over slot n_past in KV cache, where n_past is the value "remembered" by the mtp_kv_update_cache. In the previous example, n_past would be 0 through 19, so llama_build_and_execute_mtp_graph would write over layer 46 of KV cache for slots 0 through 19. Then I believe the memory context will handle the rest correctly (i.e. the next time you try to get a ubatch the normal way, it will point sinfo to slot 20 rather than 21). There's probably a more graceful way to do this as well.

@SamuelOliveirads
Copy link
Author

Hey @F1LM1, thanks for the analysis. It helped me a lot to better understand the problems.

I found an issue with the cache, specifically with positions when processing larger prompts. My guess was that something was wrong with the u-batch process, and fortunately, your analysis helped me find where the issue was.

I've implemented a small fix for the positions, but it will raise an error if you try to process something larger than your batch size. In my free time, I'm currently working on a fix for the MTP process, as it is incorrectly updating the cache positions.

One specific problem I'm addressing is that the mtp_update_kv_cache function is being activated in the generation draft using the wrong hidden state, which messes up the MTP to the point where predictions become more random. I will continue to work on this fix so we can have the proper quality of answers from the model.

@SamuelOliveirads
Copy link
Author

@F1LM1 These were my last two commits to fix the crash that occurred when the prompt size exceeded the batch size, alongside other small fixes.

Over the last few days, I have been and still am stuck with a problem where the MTP draft's checksum is changing. Contrary to my assumptions, this indicates the MTP process is polluting (or "poisoning") the main model's KV cache. The result is that while you cannot see major issues with small prompts (the output 'is' coherent), for large prompts (e.g., >5000 tokens), the output is terrible.

I tried many approaches; one was to block the "store to KV cache" logic inside build_attn specifically for the MTP draft path, but this failed as it broke the embedding buffer.

I will probably now try to clean up the code in an attempt to make it easier to debug and find the real issue and its fix.

Regarding the last two bugs you spotted, I believe I have fixed them with the changes to the MTP cache warmup and position handling logic.

If you find any other problems, please feel free to share. I believe the code is moving in the right direction, but using llama_context::decode adds another layer of complexity that is proving to be time-consuming.

@F1LM1
Copy link
Owner

F1LM1 commented Oct 5, 2025

@F1LM1 These were my last two commits to fix the crash that occurred when the prompt size exceeded the batch size, alongside other small fixes.

Over the last few days, I have been and still am stuck with a problem where the MTP draft's checksum is changing. Contrary to my assumptions, this indicates the MTP process is polluting (or "poisoning") the main model's KV cache. The result is that while you cannot see major issues with small prompts (the output 'is' coherent), for large prompts (e.g., >5000 tokens), the output is terrible.

I haven't looked at the most recent commits yet, will fire up the debugger and take a look tomorrow, but this sounds like the second issue I was describing last time; I'm not sure if I described it precisely. The MTP layer and the main model use the same memory context. Roughly, the KV cells are an N x K size array where N is the number of tokens processed so far and K is the number of layers. This means that it's hard for the MTP KV cache to directly overwrite the same addresses as the main model's KV cache, since they exist in different "columns." However, if you're not careful with the ubatch construction, by default it will assume that you've written to all K columns when you write a new row (a new token), even if you've only written to one (the MTP column). This means you end up with

|    1   |    2   |    3   | ... | K (mtp) |  layers
| tok1   | tok1   | tok1   | ... |         |
|        |        |        | ... | tok1    |
| tok2   | tok2   | tok2   | ... |         |
|        |        |        | ... | tok2    |

whereas what you need is

|    1   |    2   |    3   | ... | K (mtp) |  layers
| tok1   | tok1   | tok1   | ... | tok1    |
| tok2   | tok2   | tok2   | ... | tok2    |
| tok3   | tok3   | tok3   | ... | tok3    |

where each row is a token, and layers 1 through K-1 are set during the main model call, and layer K is set when the MTP head is run. The KV cells are automatically written to in the forward step (if you open the kv cache object, you'll see where they get defined in the graph).

I vaguely recall that it was mctx.sinfos that determined which row gets written to, but I don't 100% remember, will have to look at it tomorrow.

@SamuelOliveirads
Copy link
Author

I've been digging into the KV cache issue, and the problem is definitely related to how the MTP cache is being handled. I've made some progress and have a few key findings to share.

Based on your two previous messages, I started with two main hypotheses:

  1. The MTP KV cache warmup process for the initial prompt is broken.
  2. The MTP cache is being written to incorrect positions, leaving gaps or "dirty" cells for the main model layers, as you described.

My latest commits were aimed at fixing both. After a lot of debugging, here is what I could figure out.

Finding #1: Positional Integrity

I haven't looked at the most recent commits yet, will fire up the debugger and take a look tomorrow, but this sounds like the second issue I was describing last time; I'm not sure if I described it precisely. The MTP layer and the main model use the same memory context.

First, I wanted to test your hypothesis about incorrect positions. I added detailed logging to trace every KV write operation for both the main model path and the MTP path.

I found a problem in this line:

slot.mtp_kv_update_batch.push_back({ ids[i], slot.n_past + i, i });

The third parameter (i) was causing an issue with the token index, which I fixed by refactoring the logic to build a proper llama_batch:

common_batch_add(accepted_batch, ids[i], slot.n_past + i, { slot.id }, false);

This seems to have fixed the problem with incorrect positions. The logs now confirm that our position management is solid. For any given chunk of tokens, the main model layers (e.g., 0-45) are written first, and then the MTP layer (e.g., 46) is written for the exact same token positions. There are no gaps.

Here's a summarized log example for a chunk of tokens at positions 4096-5117:

[KV_WRITE] path=MAIN, layer=0, ..., pos_start=4096, pos_end=5117
...
[KV_WRITE] path=MAIN, layer=45, ..., pos_start=4096, pos_end=5117

// Immediately followed by:
[MTP-UPDATE|PROMPT_WARMUP] Updating 1022 tokens...
...
[KV_WRITE] path=MTP, layer=46, ..., pos_start=4096, pos_end=5117

This proves the | tok1 | tok1 | ... | tok1 | structure is being correctly built. So it seems the issue you pointed out was there, but it should be fixed now.

Finding #2: MTP Cache Warmup

  • User provides prompt
  • Prompt is added to normal batch, prompt is also added to slot.mtp_kv_update_batch
  • Regular decode is run
  • MTP draft is run <- MTP layer KV cache is empty at this point!
  • MTP KV cache update is run

This is the strangest part. I fixed the warmup logic to correctly populate the MTP cache for the entire prompt, chunk by chunk, using the main model's fresh hidden states for each chunk. The logic in server.cpp now looks like this:

// Inside the main batch processing loop, right after the main llama_decode call
bool needs_mtp_warmup = false;
if (slot_batched && slot_batched->has_mtp) {
    if (slot_batched->state == SLOT_STATE_PROCESSING_PROMPT || slot_batched->state == SLOT_STATE_DONE_PROMPT) {
        needs_mtp_warmup = true;
    }
}

if (needs_mtp_warmup) {
    mtp_update_kv_cache(ctx, batch_view, true);
}

Logically, this should work. The MTP cache is now fully synchronized with the main model cache. However, when this logic is active, the model's output becomes incoherent and hallucinates on long prompts.

But, if I comment out this warmup block, the model's output becomes perfectly coherent and high-quality. As expected, the MTP's initial draft accuracy is low (~45%) because it has no context, but it improves over time as it populates its cache during generation.

Current Situation:

I'm now stuck on this paradox: a "correct" and full MTP cache warmup poisons the generation quality, while no warmup fixes the quality at the cost of initial MTP performance.

This leads me to believe there's a subtle, fundamental issue with how the MTP head's attention mechanism interacts with a long KV history. My sequential method of populating the cache might be creating a state that the MTP head wasn't trained to handle, even if the positions are correct.

I wanted to share these findings while they're fresh. I'll continue to investigate this behavior, but I'm very interested to hear if this triggers any new ideas on your end.

@F1LM1
Copy link
Owner

F1LM1 commented Oct 7, 2025

Hi @SamuelOliveirads, good to see you diving in here!

From what you're saying, it looks like you're using

                        LLAMA_LOG_WARN("[KV_WRITE] path=MAIN, layer=%d, n_tokens=%d, pos_start=%d, pos_end=%d\n",
                            il, ubatch.n_tokens, ubatch.pos[0], ubatch.pos[ubatch.n_tokens-1]);

to determine which KV cells are being written to. But this is actually misleading. This only tells you which token positions are recorded in the ubatch.

Look carefully at the code in llama-kv-cache-unified.cpp. The tracking is actually done by a std::vector<llama_kv_cells_unified>. Each llama_kv_cells_unified tracks the positions in KV cache that are used by a given sequence. In simple testing there is only one sequence, which simplifies things. So in a llama_kv_cache_unified member function like seq_rm, the cells property is just v_cells[0]. We turn our attention to llama_kv_cells_unified next. This tracks a bunch of data about a given sequence and cell, like

  • which cells this sequence uses used
  • which token positions correspond to which cells pos <- these token positions are the ones that you see in your debug code above pos_start and pos_end!
  • other stuff

I claim that the current setup still does not write the layer 46 data into the same cells as the layer 0-45 data, despite the fact that we declare the data to be about the same positions. In other words, after processing 100 tokens, we see cells.used = 200, we see two entries in cells.pos corresponding to each of the 100 tokens, etc.

How does this arise? It's due to the default way the canonical operations manage ubatch creation/unified KV cache. In particular, these functions don't care about which token positions the ubatch is supposed to represent; they only know that KV cells should never be overwritten unless they were explicitly removed. So even if you create two batches with the same ubatch.pos, they still get written to different cells. To see this in the code, look first at llama_context::decode:
mctx = memory->init_batch(*balloc, cparams.n_ubatch, output_all);
which in turn has:
auto sinfos = prepare(ubatches);
For each ubatch, we call llama_kv_cache_unified::find_slot. This is the critical function that builds the slot_info whose member idxs determines which cells the ubatch's outputs will get written to. In this critical while loop, we see that find_slot will set idxs in order to the first N empty cells. However, if we've already built and decoded a ubatch for the main model only on the first 100 tokens, then cells.is_empty(i) for i = 0 to 99 will be false. Thus when we build the MTP ubatch, find_slot will return a slot_info with idxs ranging from 100-199 instead!

This is why my prototype built its own sinfos where I directly created and set idxs myself (intentionally set to be the same cell index that the layers 0-45 wrote into), then placed this sinfos directly into mctx, rather than using the canonical batch/ubatch creation functions; they are too good at accounting for our purposes.

I've verified this behavior using a debugger and it might be helpful for you to do so as well. Just place a checkpoint near the beginning of apply_ubatch and inspect sinfo.idxs. You could also just use a print loop, whatever you prefer. You'll see that for an initial batch of 100 tokens that sinfo.idxs = [0, 1, ..., 99], with update_mtp_kv=false, use_mtp_head=false; then another batch of 100 tokens with sinfo.idxs = [100, 101, ..., 199] with update_mtp_kv=true, use_mtp_head=true.

Also look at this->v_cells[0].pos. You'll see that

pos[0] = 0    // these are the cells written by the main model
pos[1] = 1
...
pos[99] = 99
pos[100] = 0    // these are the cells written by the MTP head
pos[101] = 1
...
pos[199] = 99

In other words, there are two KV cells associated with each token position, one for layers 0-45 and one for layer 46. But the main model forward just sees the entire KV cache including entries 0-200, so it sees 100 tokens of useful KV cache, and then 100 empty tokens, causing it to produce corrupted output. You should check to make sure I haven't just made a silly mistake/looking at the wrong commit or something, but I'm like 90% confident this is the issue.

FWIW, I don't think this behavior is completely isolated to the warmup step. It's creating doubled entries during token generation as well, but since the gaps are smaller (alternating between valid KV cache entries and empty ones) the main model quality is affected less.

If it turns out this is indeed the issue, a dirty fix is possible and maybe easy using my set_sinfos function. But to make this robust, you'd have to track which cells contain info for given token positions after the main model ubatch, and make sure that sinfos.idxs is set to those same positions for the MTP ubatch.

@SamuelOliveirads
Copy link
Author

From what you're saying, it looks like you're using

                        LLAMA_LOG_WARN("[KV_WRITE] path=MAIN, layer=%d, n_tokens=%d, pos_start=%d, pos_end=%d\n",
                            il, ubatch.n_tokens, ubatch.pos[0], ubatch.pos[ubatch.n_tokens-1]);

to determine which KV cells are being written to. But this is actually misleading. This only tells you which token positions are recorded in the ubatch.

My bad, I indeed didn't notice that I was looking at the logical positions rather than the physical cells.

I've verified this behavior using a debugger and it might be helpful for you to do so as well. Just place a checkpoint near the beginning of apply_ubatch and inspect sinfo.idxs. You could also just use a print loop, whatever you prefer. You'll see that for an initial batch of 100 tokens that sinfo.idxs = [0, 1, ..., 99], with update_mtp_kv=false, use_mtp_head=false; then another batch of 100 tokens with sinfo.idxs = [100, 101, ..., 199] with update_mtp_kv=true, use_mtp_head=true.

Also look at this->v_cells[0].pos. You'll see that

pos[0] = 0    // these are the cells written by the main model
pos[1] = 1
...
pos[99] = 99
pos[100] = 0    // these are the cells written by the MTP head
pos[101] = 1
...
pos[199] = 99

Yes, I took a look and the problem was indeed there. It was due to my lack of deep knowledge on this part of the codebase. You pointed out several times to look at sinfo and I didn't investigate it properly, at least not to the full extent needed to understand the big picture.

FWIW, I don't think this behavior is completely isolated to the warmup step. It's creating doubled entries during token generation as well, but since the gaps are smaller (alternating between valid KV cache entries and empty ones) the main model quality is affected less.

Yes, it's not. I oversimplified my findings before. I was noticing that a corruption was occurring, but it's subtle and difficult to see with a small context. By disabling the warmup, I was inadvertently masking the issue by starting with a small context, but the corruption was still happening. It just took more generation steps for the incorrect output to become obvious.

If it turns out this is indeed the issue, a dirty fix is possible and maybe easy using my set_sinfos function. But to make this robust, you'd have to track which cells contain info for given token positions after the main model ubatch, and make sure that sinfos.idxs is set to those same positions for the MTP ubatch.

That's the point that I want to share. Over the last few days, I implemented a fix based on your suggestion. I now store the sinfo from the main model's decode pass and then force the MTP update (warmup or token accept) to reuse that same sinfo, bypassing find_slot.

In terms of performance, it's the same as your original branch, at least in my configuration. However, after fixing the warmup, I could still spot some subtle randomness in the output, which led me to discover the second issue you predicted: duplicated entries during token generation.

Cell    9: Logical Position =    9 | Seq Info = [ID=0, Count=1] // << Written by MTP Draft
Cell   10: Logical Position =    9 | Seq Info = [ID=0, Count=1] // << WRITTEN BY VALIDATION
Cell   11: Logical Position =   10 | Seq Info = [ID=0, Count=1] // << Written by Validation
  1. The MTP draft head correctly generates a token for logical position 9 and writes its KV state to the next available physical cell, Cell 9. The cache is consistent at this point.
  2. The main model then runs validation for a batch of 2 tokens (the original at position 9 and the draft at position 10). It calls find_slot, which allocates the next two free cells, Cell 10 and Cell 11.
  3. The validation pass then writes the main model's KV state for logical position 9 into Cell 10.

This creates two cache entries for the same logical position, which corrupts the attention state.

I'm thinking of how to handle that. My best guess is to make the MTP draft's write to the cache a temporary one, and then purge its metadata immediately after using a function like llama_kv_cache_seq_rm. This way, when the validation pass calls find_slot, the cell used by the draft will be seen as free again, and the main model will correctly overwrite it, ensuring consistency.

Other options that I considered were to allow find_slot and prepare to "reuse this cell and find another one for the rest," but this would probably create a lot of work and significantly change a core function. Another was to not allow build_attn to update cpy_k and cpy_v for the draft pass, but I know from previous attempts that this is not easy and can create bugs at a lower level.

@SamuelOliveirads
Copy link
Author

@F1LM1 I believe that now is ready for another reviews as I didn't find more bugs or problems with performance.

@SamuelOliveirads
Copy link
Author

My last five commits were mainly focused on refactoring and cleaning up the code, while also fixing small bugs that I found along the way.

For now, I'm looking into performance, as this branch currently has similar speed to the original glm4-moe-mtp branch. One issue I've identified is a problem with incorrect graph reuse for each new interaction. This was not the initial goal of this branch, which was simply to enable batch processing.

If I don't find any more bugs, I will start to address the performance problems and apply fixes. This might be done in a new branch, depending on the scope of the required changes.

@SamuelOliveirads SamuelOliveirads marked this pull request as ready for review October 12, 2025 01:24
@F1LM1
Copy link
Owner

F1LM1 commented Oct 12, 2025

My last five commits were mainly focused on refactoring and cleaning up the code, while also fixing small bugs that I found along the way.

For now, I'm looking into performance, as this branch currently has similar speed to the original glm4-moe-mtp branch. One issue I've identified is a problem with incorrect graph reuse for each new interaction. This was not the initial goal of this branch, which was simply to enable batch processing.

If I don't find any more bugs, I will start to address the performance problems and apply fixes. This might be done in a new branch, depending on the scope of the required changes.

Sounds good to me, I'll do some testing tomorrow and merge this one. Probably agree that more performance improvements makes the most sense in a separate PR.

@F1LM1
Copy link
Owner

F1LM1 commented Oct 13, 2025

I haven't had the time to fully dive into the behavior today, but it feels vaguely like there might be an off-by-one somewhere after prompt processing. In particular, seems like the model isn't outputting a <think> token in my testing. Will dive deeper tomorrow.

@SamuelOliveirads
Copy link
Author

I haven't had the time to fully dive into the behavior today, but it feels vaguely like there might be an off-by-one somewhere after prompt processing. In particular, seems like the model isn't outputting a <think> token in my testing. Will dive deeper tomorrow.

In response to your findings, I ran a small test with about 20 requests and observed a 25% error rate where the model failed to use the <think> tag. This issue does not occur when the slot.has_mtp flag is deactivated.

For verification, I also ran another 20 requests on your branch and could not reproduce the problem there.

Interestingly, I found that using the arguments --jinja-prompt and --reasoning-format auto, as suggested in the closed issue #16439, seems to prevent the behavior from occurring.

I don't want to speculate too much, but my initial suspicion is that this could be related to my changes to the graph or the use of llama_kv_cache_seq_rm to clear tokens from the cache. Of course, it could be something else entirely.

I probably won't have time to look into this more deeply until tomorrow or later, but I would definitely appreciate it if you find anything in the meantime.

@SamuelOliveirads
Copy link
Author

I've identified and fixed the bug.

The issue was that MTP cache operations (WARMUP and UPDATE_ACCEPTED) were overwriting the main model's valid logits inside llama_decode. This caused incorrect sampling, such as the failure to generate the <think> tag.

The fix prevents llama_decode from copying logits for these cache-only MTP operations, preserving the correct predictions from the main model.

This also explains why --jinja-prompt seemed to help: it improved the MTP's predictions, coincidentally aligning the corrupted logits with the correct ones, but it didn't solve the underlying problem. The fix I've applied resolves the issue at its root.

@F1LM1
Copy link
Owner

F1LM1 commented Oct 17, 2025

I agree, not seeing any obvious bugs in this revision in early testing. Hope to do some more thorough verification this weekend but tentatively think we should be happy with where this is now.

@InfernalDread
Copy link

InfernalDread commented Oct 18, 2025

Hello everyone! First let me say that I love both your guys work regarding implementing MTP functionality with the GLM series! I just wanted to add my findings when testing this latest branch after your many fixes. I am getting better performance with a separate draft model than when using the MTP layers alone and when using MTP + Separate draft model.

I am using GLM 4.5 Air at 4bit quant

Only Separate Draft Model: 21 tok/sec decoding

Only MTP Layers: 13.5 tok/sec decoding

MTP + Separate Draft Model: 15.6 tok/sec decoding

I am not entirely sure why that is, but I found it interesting enough to share.

Once again, thank you all for your continued efforts in making this a reality!

@SamuelOliveirads
Copy link
Author

@InfernalDread Thanks for sharing this! Your benchmarks are very insightful.

Until now, I haven't tested it with a separate draft model. I assume you're using the one created by the community on Hugging Face? It's interesting that you were able to combine it with the MTP layers.

To give you some context on the current progress of MTP in llama.cpp, here is the roadmap I'm following:

  1. Initial Implementation (Proof of concept): Done (Is what you see in server: implement GLM-style MTP ggml-org/llama.cpp#15225).
  2. Architectural Refactor: Refactoring the code to align with the llama.cpp architecture. (We are here!)
  3. Performance Optimization: Making MTP worthwhile to use. I am currently studying this phase.
  4. Extensive Tests and Reviews.

Regarding performance, MTP is currently slower than not using it. This is because it doesn't yet leverage all the optimization features that llama.cpp offers. My immediate goal is to first match the baseline performance (without MTP), and then surpass it by at least 30% before considering it for a merge. Future follow-up PRs can then enhance it further, as MTP has many options for improvement.

Once I have the next PR ready with performance improvements, I would definitely appreciate some benchmarks. If you'd like, I can mention you when we have something ready to test so you can try it out and share your findings.

@InfernalDread
Copy link

InfernalDread commented Oct 20, 2025

@SamuelOliveirads That is a very solid plan and I am very impressed with how far you guys have gotten in such a short amount of time! I genuinely cant wait until this project reaches its finale! To answer your first question, yes, I am using a draft model made the community on HuggingFace, specifically this one: jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF, it works surprisingly well for its size! Also, thank you for the explanation, the fact that the performance I saw was without further optimizations/improvements is a very good sign of what is to come with MTP. Lastly, yes, I would love if you could mention me when the next stage is ready for testing/benchmarking, as I would be glad to help in any way that I can to help confirm the changes made.

Thank you once again!

@F1LM1 F1LM1 merged commit c2d7c76 into F1LM1:glm4-moe-mtp Oct 20, 2025
@F1LM1
Copy link
Owner

F1LM1 commented Oct 20, 2025

Didn't find suspicious behavior, so merging. Let's keep discussing performance bottlenecks in another PR? (I'll be away from my computer for the next week or so but will keep checking in here)

@SamuelOliveirads SamuelOliveirads deleted the glm4-mtp-batch branch October 20, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants